Process Documents from Files
						(Text Processing)
					
	
		
		
		Synopsis
Generates word vectors from a text collection stored in multiple files.Input
word listThe word list port.
Output
example set (Data table)The example set port.
word listThe word list port.
Parameters
- text directoriesIn this list arbitrary directories can be specified. All files matching the given file ending will be loaded and assigned to the class value provided with the directory.
 - file patternA pattern for the file to be read. Usual wildcards like ? and * are supported.
 - extract text onlyIf checked, structural information like xml or html tags will be ignored and discarded.
 - use file extension as typeIf checked, the type of the files will be determined by their extensions. Unknown extensions will be treated as text files.
 - content typeThe content type of the input texts
 - encodingThe encoding used for reading or writing files.
 - create word vectorIf checked, the tokens of a document will be used to generate a vector numerically representing the document.
 - vector creationSelect the schema for creating the word vector.
 - add meta informationIf checked, available meta information of the text like filename, date is added as attribute.
 - keep textIf checked, the input text will be stored as a special String attribute with the role text.
 - prune methodSpecifies if to frequent or to infrequent words should be ignored for word list building and how the frequencies are specified.
 - prune below percentIgnore words that appear in less than this percentage of all documents.
 - prune above percentIgnore words that appear in more than this percentage of all documents.
 - prune below absoluteIgnore words that appear in less than that many documents.
 - prune above absoluteIgnore words that appear in more than that many documents.
 - prune below rankWords are ordered by frequency and words with a frequency less than the frequency of the rank given by this percentage will be pruned.
 - prune above rankWords are ordered by frequency and words with a frequency higher than the frequency of the rank given by this percentage will be pruned.
 - datamanagementDetermines, how the data is represented internally.
 - parallelize vector creationDetermines whether the execution of Vector Creation should be parallelized.